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ABSTRACT 


is 


We develop latent variable models for Bayesian learning 
based low-rank matrix completion and reconstruction from 
linear measurements. For under-determined systems, the de¬ 
veloped methods are shown to reconstruct low-rank matrices 
when neither the rank nor the noise power is known a-priori. 
We derive relations between the latent variable models and 
several low-rank promoting penalty functions. The relations 
justify the use of Kronecker structured covariance matrices in 
a Gaussian based prior. In the methods, we use evidence ap¬ 
proximation and expectation-maximization to learn the model 
parameters. The performance of the methods is evaluated 
through extensive numerical simulations. 

1. INTRODUCTION 

Reconstruction of a high dimensional low-rank matrix from a 
low dimensional measurement vector is a challenging prob¬ 
lem. The low-rank matrix reconstruction (LRMR) prob¬ 
lem is inherently under-determined and have been receiving 
considerable attention 00 due to its generality over popular 
sparse reconstruction problems along with many application 
scopes 00 - Here we consider the LRMR system model 

y = Avec(X) + n (1) 

where y £ R m is the measurement vector, A £ R mxp9 is 
the linear measurement matrix, X £ R pX9 is the low-rank 
matrix, n £ R m is additive noise (typically assumed to be 
zero-mean Gaussian with covariance Cov(n) = /3 -1 I m ) and 
vec(-) is the vectorization operator. With m < pq , the setup is 
underdetermined and the task is the reconstruction (or estima¬ 
tion) of X from y. To deal with the underdetermined setup, 
a typical and much used strategy is to use a regularization in 
the reconstruction cost function. Regularization brings in the 
information about low rank priors. A typical type I estimator 
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X = argmin/3||y- Avec(X)|| 2 +p(X), (2) 

where j3 > 0 is a regularization parameter and g(-) is a fixed 
penalty function that promotes low rank in X. Common low- 
rank penalties in the literature 000 are 

<?(X) = ||X||* = tr((XX T ) 1//2 ), (nuclear norm) 

g(X) = tr((XX T ) s / 2 ), (Schatten s-norm) 

g(X) = log |XX T + el p \, (log-determinant penalty) 


where tr(-) denotes the matrix trace, | • | denotes determinant, 
and 0 < s < 1 and e > 0. We mention that the nuclear norm 
penalty is a convex function. 

In the literature, LRMR algorithms can be categorized in 


10], greedy solu- 
41. Most of these 


three types: convex optimization 001 ! 
tions ]4||7| TTj-13] and Bayesian learning 
existing algorithms are highly motivated from analogous al¬ 
gorithms used for standard sparse reconstruction problems, 
such as compressed sensing where vec(X) in ([T]) is replaced 
by a sparse vector. Using convex optimization we can solve 
|2) when g(X) is the nuclear norm, which is an analogue 
of using £| -norm in sparse reconstruction problems. Fur¬ 
ther, greedy algorithms, such as iteratively reweighted least 
squares |[7]| solves Q by using algebraic approximations. 
While convex optimization and greedy solutions are popular 
they often need more a-priori information than knowledge 
about structure of the signal under reconstruction; for exam¬ 
ple, convex optimization algorithms need information about 
the strength of the measurement noise to fix the parameter 
/3, and greedy algorithms need information about rank. In 
absence of such a-priori information, Bayesian learning is 
a preferred strategy to use. Bayesian learning is capable 
of estimating the necessary information from measurement 
data. In Bayesian learning we evaluate the posterior p(X|y) 
with the knowledge of prior p(X). If X has a prior dis¬ 
tribution p(X) oc and the noise is distributed as 

n ~ A/”(0,/3 _1 I m ), then the maximum-a-posteriori (MAP) 
estimate can be interpreted as the type I estimate in ([2]). As 
type I estimation requires more information (such as /?), type 






II estimators are often more useful. Type II estimation tech¬ 
niques use hyper-parameters in the form of latent variables 
with prior distributions. While for sparse reconstruction prob¬ 
lems, Bayesian learning via type II estimation in the form of 
relevance vector machine 030 and sparse Bayesian learn¬ 
ing 030?) have gained significant popularity, the endeavor 
to design type II estimation algorithms for LRMR is found 
to be limited. In 03’ direct use of sparse Bayesian learn¬ 
ing was used to realize an LRMR reconstruction algorithm. 
Bayesian approaches were used in 119 201 for a problem 
setup with a combination of low rank and sparse priors, 
called principal component pursuit J2T) . In 1201, Gaussian 
and Bernoulli variables was used and the parameters were 
estimated using Markov Chain Monte Carlo while in (24) 
an empirical Bayesian approach was used. Type II estima¬ 
tion methods are typically iterative where latent variables are 
usually treated via variational techniques (T4||T9), evidence 
approximation 115 T5J, expectation maximization 117 T 8 ) 
and Markov chain Monte Carlo 0H3- 


Our objective in this paper is to develop new type II es¬ 
timation methods for LRMR. Borrowing ideas from type II 
estimation techniques for sparse reconstruction, such as the 
relevance vector machine and sparse Bayesian learning al¬ 
gorithms, we model a low-rank matrix by a multiplication 
of precision matrices and an i.i.d. Gaussian matrix. The 
use of precision matrices helps to realize low-rank structures. 
The precision matrices are characterized by hyper-parameters 
which are treated as latent variables. The main contributions 
of this paper are as follows. 


1. We introduce one-sided and two-sided precision matrix 
based models. 

2. We show how the Schatten s-norrn and log-determinant 
penalty functions are related to latent variable models in 
the sense of MAP estimation via type I estimator 0- 

3. For all new type II estimation methods, we derive up¬ 
date equations for all parameters in iterations. The meth¬ 
ods are based on evidence approximation and expectation- 
maximization. 

4. The methods are compared numerically to existing meth¬ 
ods, such as the Bayesian learning method of p~4) and 
nuclear norm based convex optimization method pi. 

We are aware that evidence approximation and expectation- 
maximization are unable to provide globally optimal solu¬ 
tions. Hence we are unable to provide performance guar¬ 
antees for our methods. This paper is organized as follows. 
We discuss the preliminaries of sparse Bayesian learning in 
section o In section [2] we introduce one-sided precisions 
for matrices and derive the relations to type I estimators. 
Two-sided precisions are introduced in section [3] and in sec¬ 
tion [4] we derive the update equations for the parameters. In 
section [5] we numerically compare the performance of the 
algorithms for matrix reconstruction and matrix completion. 


1.1. Preliminaries 


In this section, we explain the relevance vector machine 


(RVM) 

15 

16 

221 and sparse Bayesian learning (SBL) meth- 

ods [ 17 

18 

24 

for a standard sparse reconstruction problem. 


The setup is 

y = Ax + n, (3) 


where x £ M” is the sparse vector to be reconstructed from 
the measurement vector y £ R m and n ~ Af(0, 0~ 1 I m ) is 
the additive measurement noise. The approach is to model the 
sparse signal x = [x 1 X 2 ... x n ] T as 


-1/2 

Xi = 7i Ui 


(4) 


where it,; ~ Af(0, 1) and 7 * > 0 is the precision of Xj. This is 
equivalent to setting 

p{xihi) = Y^exp(-7jX-/2) =A/ r (xj|0,7 i -1 ). 

The main idea is to use a learning algorithm for which sev¬ 
eral precisions go to infinity, leading to sparse reconstruction. 
Alternatively said, the use of precisions allows to inculcate 
dominance of a few components over other components in a 
sparse vector. Note that 7 = [71 72 ■ • • 7n] T and 0 are latent 
variables that also need to be estimated. We find the posterior 


P(x|y) = / p(x|y, 7 , / 3 )p( 7 , /3|y) d~y d0 « p(x|y, 7 , /3), 


if p( 7,/3|y) is assumed sharply peaked around 7 , 0 (this is 
version of the so-called Laplace approximation described in 
Appcndix[T[). Assuming the knowledge of 7 = 7 and 0 = 0, 
the MAP estimate is 


X = [xr x 2 ... Xn] T £- arg maxp(x|y, 7 , 0) = /?£A T y (5) 

X 

where S = (diag( 7 ) + 0A T A) -1 . In the notion of iterative 
updates, we use 4 — to denote the assignment operator. The 
precisions 7 and 0 are estimated by 


{ r !i' ew , 0 new ) £- argmaxp(y, 7 ,/3) 
r/i,P 

= arg maxp(y I 7 , 0) p(j) p(0), 

n,P 


where£>( 7 ) = IIiP(7») and f(y|7, P) = A/"(y|0, (Adiag( 7 )A T + 
0~ 1 I m )~ 1 ). Gamma distributions are typically chosen as 
hyper-priors forp( 7 j) and p(0) with the form 

La+1 

p{0) = Gamma(/3|a + 1, b) = 0 a e~ bfi , 

1 (a + 1) 

with a > — 1, b > 0 and 0 > 0. The evaluation of 
{7i‘ ew , 0 new ) leads to coupled equations and are therefore 
solved approximately as 


7 i <- 


1 - 7jEjj + 2 a 

£2 + 2 b 


and 0 4 — 


ELi 7i s « + 2 a 

||y — Ax||2 + 26’ 


( 6 ) 























( 8 ) 



1 ) 


bution will be described later). This is equivalent to 

P(X|Q ° = (l^ eXP ( 4 tr ( xT « X )) ' 

Denoting Z = XX T and tr(X T aX) = tr(aXX T ) = 
tr(aZ), we evaluate 


Fig. 1. Illustration of different sparsity promoting penalty 
functions. 


where T,a is the i’th diagonal element of X. The parameters 
of the Gamma distributions for p( 7 ) and p{0) are typically 
chosen to be non-informative, i.e. a, b —> 0. The update solu¬ 
tions of ([5]) and ([ 6 ]) are repeated iteratively until convergence. 
In sparse Bayesian learning algorithm a standard expectation- 
maximization framework is used to estimate 7 and (3. 

Finally we mention that the RVM and SBL methods have 
connection with type I estimation p4](25) . If the precisions 
have arbitrary prior distributions p( 7 i), then the marginal dis¬ 
tribution of x, becomes 


ex e 


— h(xi)/2 


p{xi) = Jp{xi\li)pi'Ti) drfi 

Given Q and for a known /3, the 


for some function h( 
MAP estimate is 


x = argmin /3||y — Ax Wt + YM 


i=1 


If p( 7 ^) is a gamma prior then p(xi) is a Student-f distribution 
with h(xi) = constant x log(x 2 + constant). One rule of 
thumb is that a ’’more” concave h(xi) gives a more sparsity 
promoting model [24j, see some example functions in Fig¬ 
ure [T] In the figure, h{xi) = \x, corresponds to a Laplace 
distributed variable Xi, log(xf + 1) to a Student-f and \xi \ 1 / 2 
to a generalized normal distribution |23 |. The relation be¬ 
tween the sparsity promoting penalty function h(xi ) and the 
corresponding prior p("/i) of the latent variable 7 ; was dis¬ 
cussed in [24] , see also |26) and |27|. 


2. ONE-SIDED PRECISION BASED MODEL 

The structure of a low-rank matrix X is characterized by the 
dominant singular vectors and singular values. Like the use of 
precisions in |4} for the standard sparse reconstruction prob¬ 
lem via inculcating dominance, we propose to model the low- 
rank matrix X as 


X = oT 1/2 U, 


(7) 


where the components of U £ R. pxq are i.i.d. A/"(0,1) and 
a £ W xp is a positive definite random matrix (which distri- 


p(X) = / p(X.\a)p(a) da 

Jay 0 

| o ;| 9 / 2 


e -|tr(aZ) 


Jay 0 

ex e“5 £ 


(2?r )P9/ 2 


p(ct)da (9) 


oc e 2 “ 


We note that g(X) must have the special form p(X) = 
ff(XX T ) for p(X) 0 to hold (as a is integrated out). As 
p(X) oc e _29 ^ x \ the resulting MAP estimator can be inter¬ 
preted as the type I estimator ([ 2 ]). 

2.1. Relation between priors 

Next we investigate the relation between the priors p(X) and 
p(a). The motivation is that the relations are necessary for 
designing practical learning algorithms. From (|9j, we note 
that p(X) is the Laplace transform of |o:| 9 ' /2 p(a)/(27r) P9//2 
[28j|, which establishes the relation. Naturally, we can find 
p(a) by the inverse Laplace transform |28) as follows 


GO) 


P (a) ex |c*r 9 / 2 J e5 tr( “ Z) e-5 9(z) dZ, 


Re Z =a* 


where the integral is taken over all symmetric matrices Z £ 
C pxp such that Re Z = a, where a* is a real matrix so 
that the contour path of integration is in the region of con¬ 
vergence of the integrand. While the Laplace transform char¬ 
acterizes the exact relation between priors, the computation 
is non-trivial and often analytically intractable. In practice, 
a standard approach is to use the Laplace approximation p9) 
where typically the mode of the distribution under approxima¬ 
tion is found first and then a Gaussian distribution is modeled 
around that mode. Letp(a) have the formp(a) oc e~^ K ^ a ^; 
then the Laplace approximation becomes 

g(Z) = min {tr(aZ) — glog |a| + K(a)} 

<A >-0 

-log|H| + constant, 

where H is the Hessian of tr(aZ) — glog |a| + K(a) evalu¬ 
ated at the minima (which is assumed to exist). The derivation 
of the Laplace approximation is shown in Appendix [T| 

Denoting K(a ) = q log | a — K{ol) and assuming that 
the Hessian is constant (independent of Z) we get that 

g( Z) = min |tr(crZ) — A(q!)| , 






where we absorbed the constants terms into the normalization 
factor of p(X). We find that g(Z) is the concave conjugate of 
K(a) 1101. Hence, for a given g(Z ) we can recover K(a) as 

K(a) = min {tr(ctZ) — g(Z)j (11) 

if K(a ) is concave (which holds under the assumption that 
K(a) is convex). Further, we can find K(ot) from K{a) 
followed by solving the prior p{a) ex e~^ K ^ a K Using the 
concave conjugate relation ( fTT) , we now deal with the task of 
finding appropriate K(a) for two example low-rank promot¬ 
ing penalty functions, as follows. 

1. For Schatten s-norm: The Schatten s-norm based penalty 
function is g(X) = tr((XX T ) s / 2 ). We here use a regu¬ 
larized Schatten s-norm based penalty function as 

s(X) =tr((XX T + el p ) s / 2 ), (12) 

where the use of e > 0 helps to bring numerical stability 
to the algorithms in Section [4] For the penalty function 
m we find the appropriate K (a) as 


3. TWO-SIDED PRECISION BASED MODEL 

In this section, we propose to use precision matrices on both 
sides to model a random low-rank matrix. We call this the 
two-sided precision based model. Our hypothesis is that the 
two-sided precision helps to enhance dominance of a few sin¬ 
gular vectors. For low-rank modeling, we make the following 
ansatz 

X = a~ 1/2 Ua~ 1/2 (16) 

where a R £ R pxp and a R £ R. qxq are positive definite 
random matrices. Using the relation vec(X) = (a R ® 
a L 1 ^ 2 ) vec(U) = (a R ® a^) -1 / 2 vec(U), we find 

p(X\a L ,a R ) 

= ^ 0 ^/ 2 — exp (- vec ( X ) T ( a fl ® «L)vec(X)/2) 
= - ex P (-tr(X T a L Xa fl )/2) . (17) 


K{a) = C s tr(a 2 - s ) + <7 log |a| + etr(a), (13) 

where C s = (|) 2-5 • The derivation of ( fl3] i is 

given in Appendixp] Note that, for s = 1, g(X) becomes 
the regularized nuclear norm based penalty function 

min (p,q) 

ff(X) = tr((XX T + elp,) i) = £ MX ) 2 + e)*. 

2. Log-determinant penalty: For the log-determinant based 
penalty function 

<?(X) = v log |XX T + el p | , (14) 

where v > q — 2 is a real number, we find K{a ) as 

K(a) = etr(a) + (q — v) log |a|. (15) 

As p(cl) oc e~? K ( a \ we find that the prior a is Wishart 
distributed (Wishart is a conjugate prior the distribution 
|8]i). For a scalar instead of a matrix, the prior distribu¬ 
tion becomes a Gamma distribution as used in the stan¬ 
dard RVM and SBL. 

We have discussed a left-sided precision based model (|7]i 
in this section, but the same strategy can be easily extended 
to form a right-sided precision based model. Then a natural 
question arises, which model to use? Our hypothesis is that 
the user choice stems from minimizing the number of vari¬ 
ables to estimate. If the low-rank matrix is fat then the left¬ 
sided model should be used, otherwise the right-sided model. 
A further question arises on the prospect of developing a two 
sided precision based model, which is described in the next 
section. 


To promote low-rank, we use a prior distribution p{a R , a R ) = 
p(a.L)p(ot R ). The marginal distribution of X is 

p(X) = J p(X\a L ,a R )p(a L )p(a R )da R da L . (18) 

a L yo 

CtR>- 0 

We have noticed that the use of ( fl7| > in evaluating ( p~8] > does 
not bring out suitable connections between the resulting p(X) 
functions and the usual low-rank promoting g(X) functions 
(such as nuclear norm, Schatten s-norm and log-determinant). 
Thus it is non-trivial to establish a direct connection between 
p(X.\a. R , a R ) of ( fl7] > and the type I estimator of Q. 

Instead of a direct connection we can establish an indirect 
connection by an approximation. For a given ( a R , (3) and by 
marginalizing over <y. R , we have p(X.\a R ) oc e - 5ff( Xa « x ) 
and hence the corresponding type I estimator cost function is 

min/3||y - Avec(X)|| 2 +g(Xa fl X T ). (19) 

A similar cost function can be found for a given (a R ,(3) by 
marginalizing over a R . We discuss the roles of a R and a R 
in the next section. 

3.1. Interpretation of the precisions 

From we can see that the column and row vectors of X 
are in the range spaces of ajj 1 and ot R , respectively. 

Further let us interpret this in a statistical sense with the 
note that a skewed precision matrix comprises of correlated 
components. Let us denote the (i, j)th component of « j; 1 by 
\ct~ R 1 }ij- If oFpj 1 is highly skewed then and [c*^ 1 ]^ 




are highly correlated. Suppose x, G W‘ denotes the i’th col¬ 
umn vector of X. Then following (fTT], we can write 


Xj 


’ 0 


[ a fl ]** a L \- a R ]ij a L 

\ 

. x i . 

N \ 

0 

1 

[a R ]ji a L [a R ]jj a L 

) 


The above relation shows that a presence of highly skewed 
a . R 1 leads to the cross-correlation ' ] ?J - a J' between x, 
and Xj that is comparably strong to the auto-correlations 
[ a ~R\n a ~L l ■ We mention that a low-rank property can be 
established in a qualitative statistical sense by the presence 
of columns having strong cross-correlation. The one-sided 
precision based model can be seen as the two-sided model 
where a R = I q . Hence the one-sided precision based model 
is unable to capture information about cross-correlation be¬ 
tween columns of X. A similar argument can be made for the 
right sided precision based model where a.J ' = Ip- 

4. PRACTICAL ALGORITHMS 

Considering the potential of two-sided precision matrices, the 
optimal inference problem is 

max p(X,y, a L ,a R ,P) 

which is the MAP estimator for amenable priors and often 
connected with the type I estimator in |2]). Direct handling of 
the optimal inference problem is limited due to lack of ana¬ 
lytical tractability. Therefore various approximations are used 
to design practical algorithms which are also type II estima¬ 
tors. This section is dedicated to design new type II estima¬ 
tors via evidence approximation (as used by the RVM) and 
expectation-maximization (as used in SBL) approaches. 


4.1. Evidence approximation 

In the evidence approximation, we iteratively update the pa¬ 
rameters as 


X g- argmaxp(X|y, q £ , an, P), 

(20) 

p G- argmaxp(y, a L ,a R , P), 

(21) 

a L g- argmaxp(y, q £ , an,P)A 

OLL 1 

OiR G- argmaxp(y, an, P), \ 

OCR ) 

(22) 


The solution of ( [20] ) is the standard linear minimum mean 
square error estimator (LMMSE) as 


vec(X) G- j3S A T y, where X = ((a R ® a £ ) + pA T A) 


Using a standard approach (see equations (45) and (46) of 
01 or (7.88) of [31|), the solution of \2l \ can be found as 


Pi¬ 


rn + 2 a 

| |y - Avec(X)| || + tr(AEA T ) + 2b ' 


The standard RVM in | |T5) uses the different update rule {30) 

O _ tr((a fl ®q £ )S) + 2a 
||y — Avec(X)||| + 26’ 


which often improves convergence The update rule ( |23) > 
has the benefit over ( |24| i of having established convergence 
properties. In simulations we used the update rule ( |23] l since 
it improved the estimation accuracy. 

Finally we deal with (|22) as follows. 

1. For Schatten s-norm: Using the Schatten s-norm prior 
© gives us the update equations 


/. - T - \( s - 2)/2 

a L G- c s I^Xa^X + X £ + el P J 

/ - T - - \ (*-2)/2 

an G- c s (X a R X + Xn + el q J 


(25) 


where c s = (s/ 2) s / 2 and the matrices S £ and X R have 
elements 

|f)), 

= tr(E(E<f ® a L )), 

and where E ( ^ G R pxp and E-f^ G R 9 * 9 are matrices 

L J L J 

with ones in position (i. j) and zeros otherwise. 

2. Log-determinant penalty: For the log-determinant prior 
(fl5| the update equations become 


e*L "s - v + X £ + elp^j , 

OiR •<— y ^X T a £ X + 


(26) 


We see that the update rule for the log-determinant penalty 
can be interpreted as ( [25] > in the limit s —► 0. 

The derivations of ( |25[ and ( [26] ) are shown in Appendix [3] 
and [A] The corresponding update equations for the one-sided 
precision based model |7]) are obtained by fixing the other 
precision matrix to be the identity matrix. In the spirit of the 
evidence approximation based relevance vector machine, we 
call the developed algorithms in this section as relevance sin¬ 
gular vector machine (RSVM). For Schatten-s norm and log- 
determinant priors, the methods are named as RSVM-SN and 
RSVM-LD, respectively. 


4.2. EM 

In expectation-maximization © the value of the precisions 
6 = { a £ . an, 3} are updated in each iteration by maximiz¬ 
ing the cost (EM help function in MAP estimation) 


(23) 


Q(8,6') + logp(0) 


(27) 












where 9' are the parameter values from the previous iteration. 
The function Q(0. 9') is defined as 

Q(M') =£x|y,e'[logp(y,X|0)] = constant 

- f lly - Avec(X)||2 - ^r(a L Xa fl X T ) - itr^E') 

+ | logical + |log|ajj| + ^ log/3, (28) 

where = {[ol' r ® ot' L ) + ft A t A) \ and E denotes the 
expectation operator. The maximization of Q(9,9')+\ogp(9) 
leads to update equations which are identical to the update 
equations of evidence approximation. That means that for 
the Schatten-s norm, the maximization leads to (20), (23) and 
and for log-determinant penalty, the maximization leads 
to (20) , ([23) and (26) . For the noise precision, EM reproduces 
the update equation ( [23) . The derivation of ( [28) and update 
equations are shown in Appendix [2] Unlike evidence ap¬ 
proximation, EM has monotonic convergence properties and 
hence the derived update equations are bound to improve es¬ 
timation performance in iterations. 

4.3. Balancing the precisions 

We have found that in practical algorithms, there is a chance 
that one of the two precisions becomes large and the other 
small over iterations. A small precision results in numerical 
instability in the Kronecker covariance structure (T7) . To 
prevent the inbalance we rescale the matrix precisions in each 
iteration such that 1) the a-priori and a-posteriori squared 
Frobeniun norm of X are equal, 

£[||X|||,|a ij a Ji ] = tr(a^ 1 )tr(a^ 1 ) 

= £[\\X\\ 2 F \a L ,a R ,P,y] = ||X|| 2 F + tr(X), 

and 2) the contribution of the precisions to the norm is equal, 

=tr(a^ 1 ). 

The rescaling makes the algorithm more stable and often im¬ 
proves estimation performance. 

5. SIMULATION EXPERIMENTS 

In this section we numerically verify our two hypotheses, 
and compare the new algorithms with relevant existing algo¬ 
rithms. Our objectives are to verify: 

• the hypothesis that the left-sided precision is better than 
the right sided precision for a fat low-rank matrix, 

• the hypothesis that the two-sided precision based model 
performs better than one-sided precision based model. 

• the proposed methods perform better than a nuclear-norm 
minimization based convex algorithm and a variational 
Bayes algorithm. 


In the simulations we considered low-rank matrix reconstruc¬ 
tion and also matrix completion as a special case due to its 
popularity. 


5.1. Performance measure, experimental setup and com¬ 
peting algorithms 

To compare the algorithms, the performance measure is the 
normalized-mean-square-error 

NMSE = £[||X — X||p]/£[||X||^]. 


In experiments we varied the value of one parameter while 
keeping the other parameters fixed. For given parameter val¬ 
ues, we evaluated the NMSE as follows. 

1. For LRMR, the random measurement matrix A £ M mxp9 
was generated by independently drawing the elements 
from Af( 0,1) and normalizing the column vectors to unit 
norm. For low rank matrix completion, each row of A 
contains a 1 in a random position and zero otherwise with 
the constraint that the rows are linearly independent. 

2. Matrices L £ M pxr and R € R rX9 with elements drawn 
from 7V(0,1) were randomly generated and the matrix X 
was formed as X = LR. Note that X is of rank r (with 
probability 1). 

3. Generate the measurement y = Avec(X) + n, where 
n ~ A/”(0, I m ) and is chosen such that the signal- 
to-measurement-noise ratio is 


SMNR = 


g[||Avec(X)||j] 

£[ii n iii] 


rpq 

ma n " 


4. Estimate X using competing algorithms and calculate the 
error ||X - X|||.. 

5. Repeat steps 2 — 4 for each measurement matrix T\ num¬ 
ber of times. 

6. Repeat steps 1 — 5 for the same parameter values 7j num¬ 
ber of times. 

7. Then compute the NMSE by averaging. 

In the simulations we chose T) = T 2 = 25, which means 
that the averaging was done over 625 realizations. We nor¬ 
malized the column vectors of A to make the SMNR expres¬ 
sion realization independent. 

Finally we describe competing algorithms. For compari¬ 
son, we used the following nuclear norm based estimator 


X = argmin ||X| 


, s.t. ||y - Avec(X)|| 2 < e, 


where we used e = a n Vm + y/8m as proposed in 132 
The cvx toolbox 


was used to implement the estimator. 
For matrix completion we also compared with the Variational 
Bayesian (VB) developed by Babacan et. al. (14). In VB, the 
matrix X is factorized as 


X = LR, 






0 



Fig. 2. NMSE vs. ml ( pq ) for low-rank matrix reconstruction. 


and (block) sparsity inducing priors are used for the column 
vectors of L £ R P xmin (P>?) and gT e Rq xmin(p, q ) _ The yg 
algorithm was developed for matrix completion (and robust 
PCA), but not for matrix reconstruction. We note that, unlike 
RSVM and VB, the nuclear norm estimator requires a-priori 
knowledge of the noise power. We also compared the algo¬ 
rithms to the Cramer-Rao bound (CRB) from 1 1~2]|34| (as we 
know the rank a-priori in our experimental setup). We men¬ 
tion that the CRB is not always a valid lower bound in this 
experimental setup because all technical conditions for com¬ 
puting a valid CRB are not always fulfilled and the estimators 
are not always unbiased. The choice of CRB is due to absence 
of any other relevant theoretical bound. 

5.2. Simulation results 

Our first experiment is for verification of the first two hy¬ 
potheses. For the experiment, we considered LRMR and fixed 
rank(X) = r = 3, p = 15, q = 30, SMNR = 20 dB and 
varied to. The results are shown in Figure [2] where NMSE is 
plotted against normalized measurements m/(pq). We note 
that RSVM-SN with left precision is better than right preci¬ 
sion. Same result also hold for RSVM-LD. This verifies the 
first hypothesis. Further we see that RSVM-SN and RSVM- 
LD with two sided precisions are better than respective one¬ 
sided precisions. This result verifies the second hypothesis. 
In the experiments we used s = 0.5 for RSVM-SN as it was 
found to be the best (empirically). Henceforth we fix s = 0.5 
for RSVM-SN. 

The second experiment considers comparison with nuclear- 
norm based algorithm and the CRB for LRMR. The objective 
is robustness study by varying number of measurements and 
measurement noise power. We used r = 3, p = 15 and 
q = 30. In Figure [3] (a) we show the performance against 
varying m/(pq)', the SMNR = 20 dB was fixed. The perfor¬ 
mance improvement of RSVM-SN is more pronounced over 




Fig. 3. NMSE vs. m/(pq) and SMNR for low-rank matrix 
reconstruction, (a) SMNR = 20 dB and m/(pq) is varied, (b) 
m/(pq) = 0.7 and SMNR is varied. 


the nuclear-norm based algorithm in the low measurement 
region. Now we fix m/(pq) = 0.7 and vary the SMNR. The 
results are shown in Figure [3] (b) which confirms robustness 
against measurement noise. 

Next we deal with matrix completion where the measure¬ 
ment matrix A has a special structure and considered to be 
inferior to hold information about X than the same dimen¬ 
sional random measurement matrix used in LRMR. Therefore 
matrix completion requires more measurements and higher 
SMNR. We performed similar experiments as in our second 
experiment and the results are shown in Figure [4] In the ex¬ 
periments the performance of the VB algorithm is included. 
It can be seen that RSVM-SN is typically better than the other 
algorithms. We find that the the VB algorithm is pessimistic. 

Finally in our last experiment we investigated the VB al¬ 
gorithm to find conditions for its improvement and compared 
it with RSVM-SN. For this experiment, we fixed r = 3, 
p = 15, m/(pq) = 0.7 and SMNR = 20 dB, and varied q. 
The results are shown in Figure [5] and we see that VB pro¬ 
vides good performance when p = q. The result may be at¬ 
tributed to an aspect that VB is highly prone to a large number 
of model parameters which arises in case X is away from a 
square matrix. 

6. CONCLUSION 

In this paper we developed Bayesian learning algorithms for 
low-rank matrix reconstruction. The framework relates low- 
rank penalty functions (type I estimators) to the latent variable 
models (type II estimators) with either left- or right-sided pre¬ 
cisions through the matrix Laplace transform and the concave 
conjugate formula. The model was further extended to the 
two-sided precision based model. Using evidence approx- 
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Fig. 4. NMSE vs. m/(pq) and SMNR for low-rank matrix 
completion, (a) SMNR = 20 dB and m/{pq) is varied, (b) 
m/ ( pq ) =0.7 and SMNR is varied. 


imation and expectation maximization, we derived the up¬ 
date equations for the parameters. The resulting algorithm 
was named the Relevance Singular Vector Machine (RSVM) 
due to its similarity with the Relevance Vector Machine for 
sparse vectors. Especially we derived the update equations for 
the estimators corresponding to the log-determinant penalty 
and the Schatten s-norm penalty, we named the algorithms 
RSVM-LD and RSVM-SN, respectively. 

Through simulations, we showed that the two-sided pre¬ 
cision based model performs better than the one-sided model 
for matrix reconstruction. The algorithm also outperformed a 
nuclear-norm based estimator, even though the nuclear-norm 
based estimator knew the noise power. The proposed meth¬ 
ods also outperformed a variational Bayes method for matrix 
completion when the matrix is not square. 

[Derivations] 

.1. Derivation of the Laplace Approximation 


Fig. 5. NMSE vs. q for low-rank matrix completion. 


around a minima. With this approximation, the integral be¬ 
comes 


/ ~ J g-|/( a o)—3(a~ao) T H(a-a 0 )^ a _ ^ i.^) c ~^/(ao/ 

In <|9j, the integral is given by 


/ = 


1 


(2tt)P9/ 2 


=-5[t r (“ z )-9log|«l+*'(“)]^ ai 


'ay 0 


Set /(a) = tr(aZ) — qlog |a| + K(a), where a = vec(a). 
Let ao >- 0 denote the minima of /(a) and H the Hessian at 
O! 0 . Assuming that « () and H are “large” in the sense that the 
integral over a >- 0 can be approximated by the integral over 
ol G M pxp we find that 


(2tt)P9/ 2 
(4tt)p 2 / 2 

(2tt)p9/ 2 |H| 1 /2 


3 -l/(a 0 )—i(a—a 0 ) T H(a—a 0 )^ a 


e -i/(a 0 ) 


where ao = vec(c*o). 


The Laplace approximation is an approximation of the inte¬ 
gral 


I = J e-s /(a) da, 

where the integral is over a e R". The function /(a) is ap¬ 
proximated by a second order polynomial around its minima 
ao as 


/(a) « /(ao) + i(a — a 0 ) T H(a- a 0 ), 

where H = V 2 /(a)| a=ao is the Hessian of /(a) at a 0 . The 
term linear in a vanishes and H >- 0 at ao since we expand 


.2. The EM help function 

The EM help function Q(0.9') is given by 


771 

<2(0,0') = £x|y ,6' [log p(X|y, 6 )\ = c + — log /3 
- Y^ [||y - Avec(X)|| 2 - ^[tr(a L Xa fl X T )] 
+ ^ log \ocl\ + |log|a fl |, 
where c is a constant. Using that 

£[\\y~ Avec(X)|| 2 ] = ||y|| 2 - 2y T Avec(X) 
+ tr(A T A(vec(X)vec(X) T + E')) 

= ||y-Avec(X)|| 2 +tr(A T AE') ) 





























and 
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£[tr(a L Xa fl X T )] 
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